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1  Introduction 


In  this  paper  we  are  concerned  with  the  analysis  problem  of  determining 
the  algorithm  executed  by  a  given  synchronous,  special-purpose,  multiple- 
processor  array.  The  problem  arises  because  such  arrays  (or  architectures) 
are  often  designed  heuristically.  Several  formulations  have  been  suggested 
in  the  computer  science  literature  to  solve  a  simpler  problem  called  ver¬ 
ification ,  in  which  one  wants  to  check  that  a  given  algorithm  is  indeed 
implemented  by  the  architecture. 

In  the  analysis  problem  we  are  given  the  topology  of  the  network,  the 
function  performed  by  each  processor  (including  timing  information),  and 
the  input  data  streams.  We  want  to  determine  the  algorithm  performed  by 
the  array,  and  the  iteration  interval  (».ev  the  time  between  two  consecutive 
input  samples). 

Previous  work  in  the  area  is  due  to  H.T.  Kung  and  C.E.  Leiserson  (1979); 
M.  Chen  and  C.  Mead  (1982);  H.  Lev-Ari  (1983);  H.T.  Kung  and  W.T.  Lin 
(1983);  C.J.  Kuo,  B.  Levy,  and  B.  Musicus  (1984);  E.  Tiden  (1984);  and, 
finally,  R.  Melhem  and  W.C.  Rheinboldt  (1984),  whose  work  is  the  most 
general  because  it  can  determine,  in  some  cases,  the  algorithm  implemented 
by  the  array,  instead  of  being  limited  to  verifying  the  algorithm  given.  In 
general,  these  methods  tend  to  be  somewhat  involved  and  are  applicable 


only  to  a  limited  class  of  systems.  The  main  contribution  of  this  paper  is  a 
new  approach  and  a  new  solution  to  the  analysis  problem,  based  on  ideas 
from  system  theory. 

We  view  the  analysis  problem  as  part  of  a  cycle:  starting  from  an  al¬ 
gorithm,  we  design  (or  synthesize)  a  physical  circuit;  then  we  complete  the 
cycle  (i.e.,  solve  the  analysis  problem)  by  trying  to  recover  the  original  al¬ 
gorithm  (or  one  that  it  is  equivalent  to  it  under  some  criterion).  The  first 
step  in  this  cycle  is  to  represent  a  given  iterative  algorithm,  i.e.,  a  set  of 
relations  between  sequences  of  data,  by  a  so-called  signal  flow  graph  (SFG), 
which  shows  interconnections  between  blocks  that  perform  ideal  mathemat¬ 
ical  operations  (i.e.,  take  no  time  to  compute).  The  next  step  in  the  cycle 
is  to  modify  the  chosen  SFG  to  obtain  a  logical  circuit  (i.e.,  a  hardware 
implementation  with  physical  modules  that  compute  the  same  functions  as 
the  blocks  in  the  SFG  but  in  nonzero  time  and  with  some  explicit  delays). 
For  the  analysis  problem  we  have  to  reverse  the  above  path  by  modifying 
the  logical  circuit  to  obtain  a  SFG,  and  thereby  an  associated  algorithm, 
equivalent  (in  some  sense)  to  the  one  we  started  with. 

Therefore  to  solve  the  analysis  problem  it  is  helpful  to  understand  the 
design  phase,  which  is  our  first  object  of  attention  in  this  section.  The 
whole  cycle  will  be  explored  in  some  detail  using  a  simple  example  from 


linear  system  theory;  in  fact  this  example  was  the  one  that  helped  us  to 
understand  the  analysis  problem  in  a  context  more  familiar  to  us,  since 
for  linear  systems  the  design  and  analysis  problems  are  well  understood 
and  there  are  well  established  techniques,  such  as  2-transforms  and  block- 
diagram-manipulations,  to  solve  them  (see,  e.g.,  Kailath  1980). 

1.1  Algorithms  and  SFGs 

Consider  the  iterative  expression 

t/(k)  =  blu(k-l)-aly(k-l)-hb2u(k-2)-a2y(k~2)  +  b3u(k-3)-a3y(k-3) 

(1) 

which  describes  the  relation  between  two  sequences,  u(»)  and  y(«),  that 
constitute  a  so-called  linear  filter.  This  filter  produces  a  sequence  of  output 
values  {y{k)j,  given,  at  each  k,  certain  past  values  of  y(*)  and  of  an  input 
sequence  u(»).  Representations  of  this  algorithm  using  simple  building 
-blocks — adders,  multipliers,  and  separators  (or  index-shifting  blocks) — can 
be  set  up  in  many  ways  (see,  e.g,  Kailath,  1980,  Ch.  2).  One  of  these,  the 
so-called  observer  form,  is  shown  as  a  signal  flow  graph  (SFG)  in  Figure  1, 
where  we  have  used  a  convention  (arising  from  the  use  of  what  are  called 
2-transforms)  common  in  system  theory  of  labeling  the  separator  blocks  by 
the  symbol  z_1. 
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Any  signal-flow-graph  is  a  network  of  connected  blocks.  The  intercon¬ 
necting  wires  propagate  sequences  of  data  elements,  which  we  shall  call 
variables.  The  points  at  which  variables  appear  will  be  called  nodes.  Thus, 
for  instance,  the  variable  xx (k)  denotes  the  sequence  of  data  elements  that 
appears  (for  k  =  0, 1, . . .)  at  the  output  (i.e.,  at  the  node  V|)  of  the  linear  fil¬ 
ter  in  Figure  1.  The  processors  (=blocks)  of  a  SFG  transform  one  or  several 
input  variables  into  a  single  output  variable.  In  general,  this  transforma¬ 
tion  need  not  be  linear.  The  set  of  all  variables  and  all  the  transformations 
determined  by  the  processors  constitutes  the  algorithm  performed  by  the 
SFG. 

Now,  with  the  important  convention  that  arithmetic  operations  are  in¬ 
stantaneous  (  i.e.,  the  input  and  output  quantities  have  the  same  indices), 
while  the  separators  (or  z~l  blocks)  shift  the  indices  by  unity,  we  can  write 
the  following  (so-called  “state”  *)  equations 

'The  values  {/,(/;), a-j(fc) }  describe  the  “state”  of  the  system  at  time  k,  in  the 
sense  that  knowing  them  and  {«(/),/  >  k}  we  can  compute  {y(!),l  >  A-}  irrespective  of 
the  prior  values  of  the  *,(•),  i.e.,  of  {xi(j)<  *2(j),  xs(k),  j  <  k}. 
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(2) 


xi (*)  =  &i u(fc  -  1)  -  a1z1(fc  -  1)  +  x2(Jfc  -  1) 

x2(A:)  =  62u(&  -  1)  -  a2x1(k  -  1)  +  x3(k  -  1) 

i3(A:)  =  bzu(k  -  1)  -  a2x^{k  -  1) 

y(k)  =  xt(k) 

Notice  that  these  state  equations  actually  represent  an  aggregated  SFG 
corresponding  to  the  modules  described  by  broken  lines  in  Figure  1.  Con¬ 
versely,  it  is  easy  to  see  how  to  draw  this  aggregated  SFG  from  the  equa¬ 
tions  (2). 

To  summarize  the  above  discussion,  we  can  say  that  generally  the  first 
step  in  obtaining  a  physical  implementation  is  to  start  with  an  input-output 
description  and  then  to  convert  it,  perhaps  via  the  intermediate  step  of 
constructing  some  (aggregated)  SFG  representation,  into  an  iterative  algo¬ 
rithm ,  which  is  a  set  of  equations  of  the  form 

=  fi  {x\{k  -  sn)}x2(k  -  s2l), . . .  ,xn(k  -  snl)} 

x2(k)  =  f2  {xt(k  -  sl2),  x2(k  -  s22), . . .  ,zn(k  -  sn2)} 

:  (3) 

xn{k)  —  fn  {Xi(k  —  sln),  x2{k  —  s2„), . . . ,  xn{k  —  snn)} 
where  k  is  the  (possibly  multidimensional)  index  of  iteration,  and  s,y  are 
known  as  the  index  displacements.  We  emphasize  that  this  conversion  pro- 
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cedure  is  highly  nonunique:  there  are  many  algorithms  that  can  imple¬ 
ment  a  given  input- output  map.  However,  there  is  a  one-to-one  correspon¬ 
dence  between  equations  (3)  and  the  corresponding  (aggregated)  signal-flow 
graph. 


1.2  Logical  circuits 

SFGs  are  not  truly  “physical”  implementations  of  mathematical  algorithms 
such  as  (2)  or  (3),  because  in  any  physical  hardware  implementation,  the 
arithmetic  operations  will  not  be  instantaneous.  One  way  to  accommodate 
these  physical  constraints  (and  to  interpret  the  SFG  as  a  physical  system)  is 
by  taking  the  iteration  interval  (i.e.,  the  physical  time  separation  between 
sequence  elements)  to  be  very  large,  so  that  the  arithmetic  operations  in 
each  computing  module  will  all  be  completed  before  the  next  iteration 
begins,  i.e.,  before  the  next  data  sample  is  entered  into  the  system.  A 
more  efficient  procedure,  likely  to  result  in  smaller  iteration  intervals,  is 
to  determine  a  “schedule”  of  the  times  at  which  each  operation  should  be 
performed,  as  explained  next. 

We  shall  confine  ourselves  to  digital  implementations,  in  which  we  have 
an  underlying  clock,  whose  period  will  be  taken  as  the  basic  time  unit.  Then 
the  time  required  for  additions  and  multiplications  (or  other  arithmetic 
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operations)  will  be  measured  as  integral  multiples  of  clock  cycles.  We  shall 
not  concern  ourselves  with  the  details  of  what  happens  within  any  particular 
clock  cycle. 

The  main  goal  of  the  scheduling  procedure  is  to  determine  an  appro¬ 
priate  iteration  interval,  i.e.,  the  physical  time  (measured  in  clock  cycles) 
between  two  consecutive  data  at  any  point  in  the  system  (this  will  be  the 
same  at  all  points  in  a  synchronous  system),  and  any  additional  delays 
required,  called  shimming  delays,  that  may  have  to  be  added  to  the  pro¬ 
cessing  and  transmission  delays  of  the  system  to  ensure  that  the  proper 
elements  in  the  various  sequences  are  interacting  correctly. 

Several  algorithms  for  scheduling  have  been  developed.  Here  we  briefly 
describe  the  ideas  of  Jagadish  et  al.,  1985  (see  also  Jagadish,  1985,  and 
Rao,  1985).  Figure  2(a)  shows  the  practical  modules  to  implement  the 
observer  canonical  form;  we  have  assumed  that  multiplication  (and  data 
transfer)  takes  7  clock  cycles,  addition  (and  transfer)  take  3  clock  cycles. 
We  shall  also  assume  that  a  pure  transfer  of  data  along  an  interconnecting 
wire  takes  1  clock  cycle.  The  z~i  block  is  a  conceptual  tool  used  to  express 
the  shifting  between  elements  of  sequences.  We  emphasize  that  such  in¬ 
dex  shifts  must  be  clearly  distinguished  from  physical  time  displacements; 
comparison  of  figures  1  and  3  (to  be  derived)  will  illustrate  this  point. 


Figure  2(b)  shows  how  a  schedule  can  be  computed  in  our  example. 
First,  break  the  circuit  at  the  z~l  blocks.  Second,  assign  to  the  input  the 
time  0;  keep  a  running  tab  of  the  delays  in  a  path  to  a  node  or  to  a  break 
in  the  circuit.  We  have  assigned  the  time  1  to  the  output  as  follows:  the 
input  data  arrives  at  the  third  adder  at  time  9;  therefore,  the  other  inputs 
to  this  adder  must  be  available  at  time  9;  working  backwards  we  find  that 
the  input  to  the  —at  gain  must  be  available  at  time  2,  thus  the  output 
of  the  global  system  must  be  ready  at  time  1.  We  can  now  compute  the 
times  at  the  outputs  of  gains  —  a2  and  — a3,  which  turn  out  to  be  10  and  11, 
respectively.2 

Third,  compute  the  difference  in  values  at  the  two  sides  of  the  breaks 
in  the  circuit:  we  have  14  —  10  =  4,  13  —  9  =  4,  and  12  —  1  =  11.  The 
largest  of  these  differences  will  define  an  appropriate  iteration  interval,  <5; 
in  our  example  5  =  11.  It  must  then  be  clear  that  we  must  add  some  so- 
called  shimming  delays  of  values  4  and  2  to  the  outputs  of  gains  b3  and  62 
2  la  graph  theoretical  terms  what  we  have  done  so  far  is  to  solve  a  single  source,  longest 
path  problem  for  the  SFG.  The  longest  path  to  a  node  clearly  determines  the  time 
at  which  the  node  can  be  ‘scheduled.’  To  be  precise,  certain  preliminary  separator 
transformations  have  to  be  carried  out  in  order  to  make  the  single-source  largest  paths 
problem  meaningful;  in  particular,  we  have  to  ensure  that  when  the  separators  are 
broken  we  have  an  acyclic  graph. 


Figure  3:  Logical  circuit  for  the  observer  canonical  form. 

to  ensure  that  all  the  inputs  to  each  adder  are  present  at  the  same  time 
(11  for  the  far  left  adder,  and  10  for  the  adder  in  the  middle).  Figure  3 
shows  these  shimming  delays  as  thin  black  blocks.  Now  close  the  gap  at 
the  breaks  as  follows:  when  the  time  difference  at  the  gap  is  equal  to  the 
iteration  interval,  close  the  gap  without  any  extra  delay;  otherwise,  close 
the  gap  with  a  shimming  delay  equal  to  the  iteration  interval  minus  the 
time  difference  in  that  gap. 

A  physical  implementation  of  the  observer  canonical  form  is  shown  in 
Figure  3,  in  which  the  blocks  represent  hardware  components  with  a  com¬ 
putational  delay  as  shown  in  Figure  2(a),  and  with  additional  (shimming) 
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delays  whose  value  (in  number  of  clock  cycles)  is  indicated  next  to  them. 
Digital  designers  usually  call  such  a  figure  a  logical  (circuit)  diagram.3  It 
should  be  noted  that  the  scheduling  procedure  is  highly  nonunique  (see  Ja- 
gadish,  1985,  and  Rao,  1985)  and  therefore  several  different  logical  circuit 
diagrams  can  be  associated  with  a  given  SFG.  We  shall  elaborate  on  this 
nonuniqueness  in  the  following  subsection. 

1.3  Logical  graphs  and  algorithm  graphs 

For  many  purposes,  especially  timing  analysis,  it  is  convenient  to  redraw 
the  logical  circuit  diagram  as  what  we  shall  call  a  logical  graph  G,  see 
Figure  4.  To  draw  G,  we  put  down  a  vertex  of  the  graph  for  each  node 
of  the  logical  circuit,  and  connect  the  vertices  with  edges  that  represent 
the  directed  paths  between  the  nodes  of  the  logical  circuit.  With  each 
edge  we  associate  a  weight  corresponding  to  the  total  computational  and 
propagation  delay  for  that  path.  For  example,  from  v2  to  vx  we  have  an 
edge  with  weight  3,  a  self  loop  from  vx  to  itself  with  weight  1+7  +  3=  11, 
a  loop  from  u,  to  v2  with  weight  l  +  l  +  7  +  3  +  7=  19,  and  so  on.  With 
each  vertex  v,,  we  associate  a  sequence  {x,}.  For  example,  for  the  observer 

sThe  adjective  'logical'  arises  from  the  fact  that  the  hardware  is  based  on  so-called 
logical'  components  obeying  the  rules  of  Boolean  logic  (algebra). 
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form  we  can  write  the  function  performed  at  each  vertex  as  follows 


zi()  =  6iu()  -  a^Q  +  x2() 
x2()  =  b2u()  -  a2zi()  +  ZiO 

(4) 

I3O  =  63u()  -  a3^l() 

y()  =  *i()- 

Note  that  we  cannot  write  directly  from  the  logical  graph  G  the  actual 
index  dependences  as  we  did  in  eq.  (2)  from  the  SFG.  Finding  these  index 
dependencies  (i.e.,  the  index  displacements  s,y  in  (3))  and  the  iteration 
interval  comprises  the  analysis  problem. 

Algorithm  graphs 

To  facilitate  the  reconstruction  of  an  iterative  algorithm  from  a  logical 
graph  G  of  some  physical  implementation,  we  have  to  obtain  a  represen¬ 
tation  of  iterative  algorithms  that  is  similar  in  form  to  the  logical  graph. 
For  this  purpose  we  introduce  a  so  called  algorithm  graph  G *:  this,  like  G, 
has  a  vertex  for  each  variable  in  the  algorithm,  and  its  edges  represent  the 
index  dependencies,  i.e.,  the  weight  of  the  directed  edge  connecting  vj  (the 
vertex  representing  xy())  to  v,-  equals  the  index  displacement  s,y .  Thus,  the 
algorithm  graph  is  a  precise  image  of  the  iterative  algorithm  (8),  i.e.,  there 
exists  a  one-to-one  correspondence  between  iterative  algorithms  and  their 
graphs.  The  difference  between  the  algorithm  graph  G*  and  the  SFG  is 
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that  in  G*,  we  focus  only  on  the  index  dependencies  ignoring  the  details 
of  the  actual  functional  computations.  It  is  the  determination  of  a  correct 
set  of  index  dependencies  from  the  timing  information  in  the  logical  graph 
that  is  the  key  to  the  solution  of  the  analysis  problem. 

Relating  G  and  G* 

The  logical  graph  G  and  the  algorithm  graph  G*  are  closely  related. 
Obviously,  G  and  G*  have  the  same  topology;  the  only  difference  is  that 
in  G*  the  edge  weights  represent  the  number  of  separators  in  the  path  while 
in  G  they  represent  physical  delay.  Moreover,  there  is  a  simple  relation 
between  the  index  displacement  s,y  (=number  of  separators)  associated 
with  an  edge  in  G*  and  the  physical  delay  dl;  associated  with  an  edge  in 
G.  To  make  this  relation  explicit  we  need  to  analyze  the  way  synchronous 
systems  work. 

In  synchronous  systems  the  time  between  two  consecutive  elements  in 
any  sequence  is  constant  (and  equal  to  the  iteration  interval,  <5).  If  in 
addition  we  assume  that  such  systems  are  time-invariant,  as  we  do  in  the 
logical  graphs,  all  the  computations  (at  the  vertices)  involve  data  arriving  at 
some  multiple  of  the  iteration  interval.  Consider,  for  example,  the  graph  G 
for  the  observer  form  (Figure  4).  Apply  the  first  element,  u(l),  of  an  input 
sequence  u(«)  at  time  A0  =  0;  by  definition,  the  rest  of  the  elements  will 
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be  generated  every  <5  clock  cycles.  Let  us  denote  by  A,  the  time  instant  at 
which  the  vertex  v,-  generates  the  output  x,(l). 


Recall  that  d,y  (resp.  S;y)  is  the  physical  delay  in  the  logical  graph  G 
(resp.  the  index  displacement  in  the  algorithm  graph  G*)  along  a  directed 
path  connecting  the  vertex  v,'  to  the  vertex  t/y.  Since  v,-  generates  x,(l)  at 
time  A,-,  x,(l)  arrives  at  vertex  t/y  at  time  A,-  +  d,y.  However,  we  cannot 
equate  Ay  to  A,  +  <5ty,  because  Ay  will  depend  upon  the  number  of  separators 
in  the  path  from  t/,-  to  t/y,  which  in  turn  will  fix  the  actual  iteration  in  which 
the  input  x,(l)  is  operated  on  at  vertex  t/y.  In  our  case,  the  path  from  v, 
to  Vj  has  s,y  separators  and  therefore,  the  vertex  t/y  will  associate  the  input 
x,(l)  with  the  (1  +  s,y)th  iteration  rather  than  with  the  first  iteration. 
Consequently,  t/y  will  generate  zy(l)  at  time  Ay  =  A,-  +  d,y  —  s,-y <5  where  6 
denotes  the  iteration  interval.  It  follows  that,  for  every  path  (t/j,t/y), 

dij  =  (Ay  —  A,)  +  S,y(5  (5) 

This  is  the  basic  equation;  all  logical  graphs  G  that  implement  a  given 
algorithm  graph  G*  satisfy  this  condition. 

After  this  discussion,  the  analysis  problem  can  now  be  restated  as  fol¬ 
lows:  given  a  logic  graph  G,  find  an  algorithm  graph  G*  and  a  set  of  A,-4 

4Ouly  the  A,  associated  with  the  input  vertices  are  known  apriori. 
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that  satisfy  (5). 


1.4  An  analysis  procedure 

We  now  present  an  analysis  procedure  that  constructs  nonnegative  s,y  and 
a  set  of  A ,  for  a  given  logical  graph  G.  First  we  form  a  rooted  tree  consisting 
of  the  shortest  paths  from  the  input  vertex  v0  to  the  remaining  vertices  of 
G.  Next  we  set  s,y  =  0  for  the  edges  contained  in  the  rooted  tree.  This 
determines  A,-  for  all  vertices  and  we  can  use  equation  (5)  to  compute  the 
index  displacements  s,y  for  the  edges  that  are  not  in  the  rooted  tree. 

As  an  example,  in  the  observer  form  of  Figure  4  the  first  input  will 
reach  vertex  v2  from  v0  at  time  A2  =  20  (minimum-delay  path).  Along  the 
path  that  goes  through  v3,  however,  this  first  input  will  not  reach  v2  until 
time  31  (i.e.,  21  4-  10).  But  at  time  31,  we  shall  already  have  at  vertex  v2 
the  second  datum,  x2(2),  that  traveled  along  the  shortest  path.  In  other 
words,  x2(2)  is  a  function  of  u(2)  and  x3(l).  Hence,  we  have  to  put  one  z~l 
block  in  the  longer  path.  Figure  6  shows  the  graph  G*  with  the  number  of 
separators  in  each  edge;  from  this  graph  we  can  easily  write  the  following 
state  equations 


v0 


Figure  6:  Another  graph  G *  for  observer  canonical  form. 

Xi(k)  —  biu(k)  —  ciiX^k  —  1)  +  x2{k  —  1) 
z2(A;)  =  b2u(k)  —  a2Xi(k  —  1)  +  xx(k  —  1)  (b) 

x3(fc)  =  b2u(k)  -  azXi[k  -  1). 

When  the  logical  graph  has  several  input  vertices  we  shall  need  to  extend 
the  logical  graph  by  introducing  an  auxiliary  root,  as  described  in  Section  3. 

We  now  observe  that  the  graphs  G*  in  Figures  6  and  5  are  different,  and 
consequently  that  the  resulting  algorithms  (eqs.  (2)  and  (6))  are  apparently 
different.  However,  the  difference  in  the  algorithms  in  eqs.  (2)  and  (6) 
amounts  only  to  shifts  in  the  indices  (we  can  change  the  indices  for  the 
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sequence  u(#)  in  eq.  (6)  from  k  to  k  -  1  and  obtain  eq.  (2))  and  therefore 
both  algorithms  are  equivalent  in  the  sense  that  they  determine  the  same 
set  of  recursions. 


This  observation  holds  for  every  algorithm  graph  G*  and  for  arbitrary 
shifts  in  the  indices  of  data  sequences.  In  other  words,  if  we  redefine  x,(fc)  := 
Xi(k-Cj)  and  replace  i,(  )  by  x,()  in  the  algorithm  graph,  the  computations 
remain  unchanged,  even  though  the  index  displacements  become  modified, 
i.e.,  Sjj  — >  Sij  +  Ci  —  Cj.  We  shall  say  that  two  algorithm  graphs  are  equivalent 
if  they  are  isomorphic  (in  the  graph  theoretic  sense)  and  if  their  index 
displacement  can  be  related  by  a  set  of  shifts  c,,  as  described  above. 

The  above  discussion  has  been  informal.  We  shall  present  and  prove  the 
general  procedure  to  obtain  the  algorithm  graph  G*  in  Section  3.  It  is  based 
on  determining  the  shortest-path  tree  (minimum  delay),  which  is  consistent 
with  the  iemark  in  footnote  2  that  the  inverse  problem  (viz.  the  schedul¬ 
ing  problem)  involves  solving  a  single-source  longest-paths  problem).  The 
shortest  path  procedure  guarantees  that  the  number  of  separators  we  obtain 
in  the  links  of  the  tree  is  integral  and  positive. 
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2  Formulation  of  the  Analysis  Problem 


In  this  section  we  discuss  in  more  detail  some  of  the  key  concepts  presented 
in  Section  1.  The  first  subsection  is  dedicated  to  presenting  a  model  for 
synchronous,  special-purpose  systems.  In  the  second  subsection  we  refine 
the  concept  of  iteration  interval.  In  the  third  subsection,  we  formalize  the 
concept  of  equivalence  for  a  pair  of  SFGs  that  have  the  same  topology 
and  functionality,  but  that  differ  in  the  weights  on  their  edges  (different 
distribution  of  separators). 

2.1  Logical  Graphs 

As  it  is  well  known  in  circuit  theory  (see,  e.g.,  Rohrer,  1970),  physical  cir¬ 
cuits  can  be  modeled  by  a  directed  graph.  For  instance,  Leiserson  and  Saxe 
(1981)  modeled  circuits  as  a  graph  in  which  the  vertices  are  the  computing 
elements  and  the  edges  the  interconnections  between  these  elements.  The 
edges  have  weights  associated  with  them,  representing  the  number  of  shift 
registers  present  in  the  interconnection.  Moreover,  the  vertices  have  also 
weights  associated  with  them,  representing  the  computational  delay. 

To  model  a  synchronous  circuit,  Leiserson,  Rose,  and  Saxe  (1983)  im¬ 
pose  non-negativity  constraints  on  the  weights  associated  with  the  edges 
and  vertices.  In  addition  they  require  that,  in  any  directed  cycle  in  the 
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graph,  the  sum  of  the  edge  weights  must  be  positive.  The  need  for  such 
a  constraint  for  digital  circuits  was  shown  by  Crochiere  and  Oppenheim 
(1975);  see  also  previous  work  by  Karp  and  Miller  (1966). 

Below  we  model  a  synchronous,  special-purpose  system  as  a  finite,  di¬ 
rected  graph.  For  our  needs,  we  lump  the  propagation  and  computation 
time  in  a  weight  associated  with  the  edges,  which  represent  the  interconnec¬ 
tions  between  variables  or  sequences  (which  are  the  vertices  of  the  graph). 
We  impose  positivity  constraints  on  the  edge  weights  as  in  the  previous 
models.  To  describe  synchronism,  we  associate  a  data  sequence  to  each 
vertex  and  impose  the  constraint  that  the  time  between  two  consecutive 
data  in  all  sequences  must  be  a  constant,  which  we  call  iteration  interval. 

Definition  1  A  synchronous,  special-purpose  system  is  a  finite,  directed 
graph  G  =  (V,  E,  x,  d,  8),  in  which 

•  V  is  the  set  of  vertices  of  the  graph.  They  represent  the  sequences  (or 
variables)  in  the  circuit. 

•  There  is  at  least  one  vertex  in  V  (called  a  source)  with  no  incoming 
edges. 

•  x  is  the  set  of  sequences  {x,}  associated  with  the  vertices  (one  per 
vertex) 

•  E  is  the  set  of  directed  edges,  which  represent  the  dependence  among 
the  variables  so  that  for  vertex  v,-;  z,(fc)  is  a  function  (invariant  in  k) 
of  the  sequences  corresponding  to  the  vertices  that  have  edges  incoming 


•  d  is  the  set  of  positive  weights  (delays)  associated  with  the  edges  (one 
per  edge),  which  represent  the  processing  and  propagation  delay  cor¬ 
responding  to  that  edge. 

•  6  is  a  positive  constant  (called  iteration  interval)  that  represents  the 
time  between  two  consecutive  elements  in  the  sequence  associated  with 
any  vertex  u,-. 

Remarks: 

1.  The  inputs  of  the  system  are  the  sequences  associated  with  the  source 
vertices.  The  outputs  are  the  sequences  corresponding  to  sinks  in  the 
graph  (t.e.,  vertices  with  no  outgoing  edges);  in  addition,  we  can 
designate  the  sequence  associated  with  any  vertex  as  an  output  of 
the  system.  When  we  want  to  show  clearly  that  we  are  referring  to 
an  input  or  output,  we  shall  denote  the  corresponding  sequences  with 
the  letters  u  and  y,  respectively.  A  subindex  indicates  the  vertex 
we  are  referencing;  an  integer  in  parenthesis  designates  a  particular 
element  in  the  sequence.  For  example,  x3(7)  is  the  seventh  element 
in  the  sequence  generated  at  vertex  v3. 

2.  In  our  model,  the  delays,  <f,y,  and  the  iteration  interval,  5,  are  integer 
multiples  of  the  clock  period. 

3.  We  use  the  convention  of  assigning  the  same  index  k  =  1  to  the  first 
element  generated  at  each  of  the  vertices  vt. 
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4.  Till  we  state  the  contrary,  we  assume  that  every  vertex  in  G  can  be 
reached  by  a  directed  path  from  at  least  one  input. 

2.2  The  Iteration  Intervals 

There  are  several  possible  notions  of  iteration  interval  for  a  synchronous 
system  G.  The  first  one,  <5,  is  the  time  between  two  data  in  the  input 
streams.  As  we  shall  see  below,  there  might  be  several  possible  6’s  that 
work  (are  valid)  in  a  system;  we  establish  the  conditions  for  validity  in 
Theorem  1  below. 

Under  certain  conditions  (see  Jover  and  Kailath,  1984),  a  system  can 
perform  an  algorithm  on  different  sets  of  data  by  interleaving  data  from 
the  different  sets  to  form  the  input  sequences  for  the  system.  Naturally,  in 
that  case  the  throughput  increases  to  a  value  given  by  the  number  of  sets 
interleaved  times  the  throughput  without  interleaving. 

The  second  type  of  iteration  interval  will  be  denoted  6i  and  refers  to 
the  time  between  data  in  a  sequence  such  that  the  system  is  operating  on 
one  sequence,  not  on  interleaved  sequences. 

There  exists  a  third  iteration  interval,  called  the  intrinsic  iteration  in¬ 
terval,  S„tl,  which  corresponds  to  the  minimum  time  interval  before  we  can 
introduce  another  datum  to  the  same  elementary  module.  Note  that 
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can  be  the  clock  period  or  a  multiple  of  it  depending  on  the  implementation. 
In  a  practical  system  we  let  (5,n<  be  the  largest  of  the  intrinsic  iteration  inter¬ 
vals  of  any  elementary  module.  Obviously,  a  practical  system  must  satisfy 
8  >  hm  and  >  clock  period ,  since  we  cannot  introduce  data  faster  than 
the  clock  period  of  a  system.  Finally,  note  that  <5,„,  is  technology  and  im¬ 
plementation  dependent,  while  8  depends  only  on  the  topology  and  delays 
in  the  graph  G. 

Below  we  present  some  theorems  concerning  the  iteration  intervals.  We 
say  that  a  graph  G  has  a  valid  8  when  a  given  value  6  for  the  iteration 
interval  satisfies  the  definition  of  a  synchronous  system  (and,  of  course, 
6  >  The  following  theorem  provides  a  method  to  test  if  a  graph  G 

has  a  valid  <5,  i.e.,  an  iteration  interval  that  makes  the  appropriate  elements 
of  the  sequences  to  meet. 


Theorem  1  1.  A  given  value  6  is  a  valid  iteration  interval  for  a  syn¬ 

chronous,  special-purpose  system  G  =  ( V,E,x,d,8 )  if,  and  only  if, 
the  delays  on  all  the  paths  between  any  two  vertices  are  congruent 5 
mod  8  and  the  delays  for  self-loops  are  a  multiple  of  8. 

5Two  numbers  Oi  and  fl2  are  defined  to  be  congruent  modulo  6  if  their  difference  is  a 
multiple  of  fi. 
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2.  8x/8  represents  the  number  of  sequences  interleaved,  where  8X  is  the 
largest  valid  iteration  interval. 

Proof: 

1.  Consider  one  vertex  v,-,  and  assume  that  there  are,  say,  r  paths  be¬ 
tween  this  vertex  and  another  vertex  t/y.  Call  the  delays  on  the  paths 
dx,...dr.  If  members  of  the  sequence  i are  sent  every  5  units  of  time 
starting  at  time  t  =  0,  then  the  member  z,(A;)  will  reach  vy  at  the  fol¬ 
lowing  times  t !  =  k6  +  dx ,  t2  =  k6  +  d2  and  so  forth  until  tr  =  k6  +  dr. 
Since  the  delays  are  congruent  modulo  6,  their  difference  is  a  multiple 
of  5.  Thus  the  differences  between  any  times  {£X|  ...tr}  is  also  a  mul¬ 
tiple  of  S,  which  proves  that  the  iteration  interval  for  the  sequence 
in  Vj  is  also  6.  The  same  argument  applies  for  self-loops.  Since  our 
discussion  was  for  any  v,-  and  t/y,  our  results  are  general  for  any  two 
vertices.  Finally,  note  that  if  we  were  given  the  fact  that  the  8  is  valid 
(i.e.,  constant  for  the  sequence  at  any  vertex),  then  the  differences 
in  {tl,...tr}  are  multiples  of  8,  which  implies  that  the  differences  in 
the  delays  {dx,...dr}  are  also  a  multiple  of  8  and  therefore  congruent 
modulo  8.  This  completes  the  proof  of  (1). 

2.  Assume  a  system  with  only  one  pair  of  vertices  connected  by  two  par¬ 
allel  edges  in  the  same  direction.  For  such  a  system,  it  is  clear  that  8X 
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is  given  by  the  difference  in  the  delays  of  both  paths.  Choosing  an 
iteration  interval  that  is  a  submultiple  of  Sl  will  result  in  indepen¬ 
dent  sequences  being  processed  ( i.e interleaving).  The  number  of 
sequences  being  <5j/<5  as  it  is  trivial  to  check. 

For  a  general  system,  if  there  is  a  pair  of  vertices  for  which  the  itera¬ 
tion  interval  is  not  the  largest  valid,  then  there  will  be  interleaving  for 
that  pair  and,  therefore,  this  value  of  iteration  interval  will  not  be 
Thus,  has  to  be  the  largest  valid  iteration  interval,  which  proves  the 
first  property.  Similarly,  for  the  number  of  sequences  interleaved  in  a 
pair  of  vertices  will  be  given  by  the  quotient  between  Si  and  the  value 
that  makes  this  pair  congruent.  Obviously,  for  the  general  system, 
the  largest  of  this  quotients  is  the  number  of  sequences  interleaved  in 
the  whole  system. 

□ 

2.3  Algorithm  Graphs 

Definition  2  An  algorithm  graph  is  a  finite  directed  graph  G‘  =  (V,  E,  x,  s) 
in  which  V,  E,  x  are  the  same  as  in  Definition  1,  and  s  is  the  set  of  pos¬ 
itive  weights  (separators)  associated  with  the  edges  (one  per  edge),  which 
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represent  the  index  displacement  corresponding  to  that  edge. 

Theorem  2  A  synchronous  system  G  =  (V,  E,x,d,6)  is  an  implementa¬ 
tion  of  an  algorithm  graph  G *  =  ( V ,  E,  x ,  5)  if,  and  only  if,  there  exists  a  set 
of  positive  constants  {A,}  ( one  per  vertex)  such  that  for  every  t/,-,  €  V , 

Ay  =  A,'  +  d{j  S  ,y  <5  — ' 

Proof:  See  the  argument  used  to  establish  eq.  (5)  □ 

The  constants  {A,}  determine  a  schedule  for  G  in  the  sense  that  x,(fc) 
is  generated  at  the  time-instant  A,-  4-  {k  —  1)<S. 

Definition  3  Two  algorithm  graphs  G*0  =  ( V ,  E,  x,  s„ )  and  G\  =  (V,  E,  x,  sb) 
(and  their  corresponding  algorithms)  are  said  to  be  equivalent  if  there  exists 
a  set  of  constants  {c,  }  (one  per  vertex)  such  that 

Cj  (7) 

Figure  7  depicts  two  equivalent  graphs  G *  for  the  aggregated  observer 
form  introduced  in  Section  1.  The  main  point  is  that  two  algorithms  can 
be  equivalent  even  if  the  indices  do  not  seem  to  match.  The  important 
fact  is  to  have  the  indices  for  the  same  sequence  related  in  the  same  way. 
Indices  between  different  sequences  can  always  be  matched  by  selecting  an 
appropriate  origin  for  the  different  indexing  variables. 
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3  An  Analysis  Procedure 


In  this  section  we  describe  our  analysis  procedure  and  provide  a  rigorous 
proof  of  its  validity. 

Procedure  1  Given  a  logical  graph  G  —  (V,  E,x,d,6)  we  want  to  deter¬ 
mine  an  algorithm  graph  G*  —  ( V,E,x,s ),  such  that  G  is  an  implemen¬ 
tation  of  G* .  For  each  input  sequence,  {tt,},  we  are  given  the  time  A,  at 
which  the  first  element  of  the  sequence  is  entered.  The  procedure  consists 
of  the  following  steps: 

1.  Extend  the  graph  G  by  adding  a  vertex  vQ  with  an  edge  from  v0  to  sad: 
of  the  source  vertices  v,,  and  with  weight  A,.  Call  this  graph  Gejt. 

2.  Form  a  spanning  tree  with  the  minimum-delay  path  from  v0  to  every 
vertex  in  Geil.  For  every  vertex  Vy  let  Ay  =  A,  -f  d,y  where  d,j  is  the 
delay  on  the  edge  that  connects  to  t»y  its  closest  vertex,  v,. 

S.  For  each  link  e ,-y  in  the  tree  formed  above  compute 

d*j  =  A,  —  Ay  +  dij. 

For  each  self-loop  with  weight  d,»  assign 

d:,  -  d„. 
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Figure  8:  Shortest-path  tree  for  observer  form. 

4-  Compute  <5i  =  gcd{d Vi,;}. 

5.  The  number  of  sequences  interleaved  is  given  by  5i/5.  //(<5i/<5)  ^  Z+ 
then  the  6  given  is  not  valid  for  the  graph  G. 

6.  Associate  a  weight  Sij  >  0  to  each  link  eij  in  the  tree  formed  in  step  2 

•a  =  <*:>/«.• 

Associate  zero  weight,  s^  =  0,  to  the  edges  in  the  tree. 

As  a  simple  example  recall  the  observer  form  of  Section  1.  Notice  that 
an  extension  (step  1)  is  not  required,  because  A0  =  0;  the  corresponding 
shortest-path  tree  and  the  resulting  index  displacement  are  described  in 
Figure  8. 

Remarks: 
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1.  In  addition  to  the  logical  graph  G,  we  are  given  the  time  instants  A,- 
at  which  the  first  element  from  each  of  the  input  sequences  {u,}  is  en¬ 
tered.  If  these  time  instants  were  not  given,  the  procedure  would  still 
work,  but  information  on  the  relative  shift  between  input  sequences 
could  not  be  recovered.  (As  we  noted  in  Section  2.3  the  SFG  recov¬ 
ered  will  still  be  equivalent  to  the  one  used  in  the  design  of  the  logical 
graph  G .) 

2.  If  the  iteration  interval  6  is  not  given,  our  procedure  still  will  work 
and  will  compute  a  valid  6  for  that  system,  under  the  assumption 
that  there  is  no  interleaving  of  input  sequences.  In  this  case  there  is 
no  need  to  perform  step  5  of  the  procedure. 

3.  There  are  several  algorithms  for  implementing  step  2  of  the  procedure 
(shortest-path  tree),  such  as  those  in  Dantzig  (1975).  For  a  compre¬ 
hensive  presentation  of  these  algorithms  see,  e.g.,  Even  (1979). 

We  now  turn  to  establish  the  validity  of  our  analysis  procedure. 

Proof  of  the  Analysis  Procedure 

We  establish  Procedure  1  in  three  stages.  First,  to  prove  steps  1  through  5 
of  the  procedure  we  prove  that  the  procedure  determines  whether  the  5 
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given  is  valid  and  that  it  computes  8t  correctly.  Second,  we  prove  step  6 
of  the  procedure,  i.e.,  that  the  number  of  states  assigned  to  a  link  will  be 
nonnegative.  Third,  we  prove  that  the  procedure  recovers  an  algorithm  that 
is  equivalent  to  the  original  algorithm  used  in  generating  the  synchronous 
system. 

Computing  a  valid  6 

In  this  section  we  prove  that  steps  1-5  of  Procedure  1  are  correct.  We  will 
prove  first  that  a  valid  8  is  a  divisor  of  the  delays  {d*;  Vi,;}.  Every  time 
we  add  a  link  e,y  from  t»,  to  Vj,  we  are  creating  two  paths  to  reach  v}  (see 
Figure  9);  one  of  the  paths  has  delay  A,-  +  d,;  and  the  other  Ay.  Applying 
Theorem  1  we  see  that  these  delays  must  be  congruent  modulo  <5;  therefore, 
their  difference  (which  we  call  d*y)  must  be  a  multiple  of  6. 

To  prove  that  6i  =  gcd{d‘ ■  Vi,;}  just  consider  that  any  valid  8  has  to 
be  a  divisor  of  {d'{-}.  By  Theorem  1,  <5X  must  be  the  largest  valid  6,  which 
completes  our  proof.  D 

Determining  the  states 

In  this  section  we  prove  that  step  C  of  Procedure  1  generates  index  dis¬ 
placements  that  are  positive  and  integer. 


To  prove  that  s,y  >  0  note  that  s,y  will  be  nonnegative  if  and  only  if 
>  0.  Since  we  have  formed  a  shortest-path  tree  from  v0,  it  means  (again 
see  Figure  9)  that 

+  dij  > 

and,  therefore, 

d"j  —  A,  —  A j  +  dfj  ^  0. 

Thus  s,y  >  0.  Finally,  since  <5X  is  a  divisor  of  every  d*  ,  it  is  obvious 
that  s,; ,  as  computed  in  Procedure  1,  will  also  be  integer.  □ 

Recovering  the  equations 

In  this  section  we  prove  that  Procedure  1  recovers  an  algorithm  G*  that  is 
equivalent  to  the  equations  used  in  the  design  of  the  logical  graph  G. 

The  algorithm  G*  recovered  by  our  procedure  satisfies  the  equation 
Ay  =  A,  4-  d,y  -  s,y <5 ,  while  the  original  algorithm  G*,  used  to  design  the 
logical  circuit  G,  satisfies  the  equation  Ay  =  A,-  +  d,y  —  s,y<5.  Assuming  all 
graphs  in  consideration  have  been  extended,  we  can  assume  for  the  root 
vertex  va  that  A0  =  0  =  A0.  Consequently, 

A,  —  A,  =  (s,o  —  s,o)5 

so  that  we  can  replace  A,  by  A,  +  c,<5,  where  c,  :=  sl0  -  s,o,  and  we  conclude 
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that 


Sjj6  —  A,  Ay  +  d,j  —  A,-  Ay  +  d,]  +  (c,'  C})5  —  (s,y  +  c,-  Cy)(5 

which  establishes  the  equivalence  between  G *  and  G *  □ 

This  completes  the  proof  that  our  procedure  is  indeed  correct  in  each 
of  the  steps  and  that  it  recovers  an  algorithm  equivalent  to  the  one  used  in 
the  design  of  the  system. 
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4  Concluding  Remarks 


We  modeled  a  synchronous,  special-purpose  system  as  a  directed  graph  in 
which  the  functions  performed  by  the  processors  were  associated  to  the 
vertices,  and  the  computational  and  propagation  delays  to  the  edges.  We 
discussed  extensively  the  motivation  for  viewing  the  analysis  problem  as  a 
conversion  from  this  graph  into  a  signal  flow  graph,  and  we  developed  a 
procedure  that  allowed  us  to  transform  this  graph  into  a  signal  flow  graph 
with  the  same  topology  as  the  original  graph,  but  in  which  the  weights  in 
the  edges  correspond  to  the  states  of  the  system.  This  conversion  is  the 
reverse  step  to  the  design  process.  Once  we  have  the  states  of  the  system,  we 
can  write  by  inspection  state-like  equations,  and,  therefore,  the  algorithm 
implemented  by  the  system.  Our  procedure  recovers  an  algorithm  that  is 
“equivalent”  to  the  one  the  system  was  designed  to  implement.  We  formally 
proved  our  procedure  and  gave  several  examples  of  how  to  use  it. 

Our  analysis  can  be  extended  to  systems  in  which  not  all  vertices  are 
reachable  from  at  least  one  source,  thereby  removing  the  constraint  imposed 
by  Remark  4  to  Definition  1  (in  Section  2.1).  It  can  be  shown  that  the 
logical  graph  of  such  systems  always  contain  autonomous  subgraphs  that 
are  not  reachable  from  the  rest  of  the  graph.  In  order  to  apply  our  analysis 
procedure  we  need  first  to  add  edges  from  the  root  v0  to  a  vertex  (which  is 
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not  a  sink)  in  every  unreachable  subgraph.  This  modification  makes  every 
vertex  reachable  from  the  root  v0  so  that  our  analysis  procedure  can  be 
applied.  A  detailed  description  and  justification  of  this  technique  can  be 
found  in  Jover  (1985). 

Finally,  it  should  be  noticed  that  our  procedure  can  be  significantly 
simplified  when  the  system  to  be  analyzed  is  a  systolic  array ,  i.e.,  a  regular 
network  of  identical  modules.  It  can  be  shown,  in  fact,  that  the  analysis  of 
such  systems  involves  the  application  of  our  procedure  to  a  single  module. 
More  details  can  be  found  in  Jover  (1985). 
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Appendix:  EXAMPLES 


In  this  appendix  we  illustrate  our  procedure  by  analyzing  the  arrays  previ¬ 
ously  verified  or  analyzed  in  the  literature.  Most  of  the  authors  limit  their 
efforts  to  the  Kung-Leiserson  matrix- multiplication  array;  the  most  com¬ 
prehensive  effort  was  made  by  Melhem  and  Rheinboldt  (1984),  who  studied 
three  additional  examples:  reverse  convolution,  sorting,  and  back  substi¬ 
tution.  We  analyze  these  three  examples  below.  Note  that  the  examples 
studied  by  the  previous  authors  always  assume  that  all  the  operations  take 
one  unit  of  time  to  perform  (in  our  language,  that  the  computational  de¬ 
lays  are  unity  for  all  the  edges  of  G)\  to  make  a  different  assumption  might 
complicate  their  methods.  In  contrast,  we  can  assign  “realistic”  values  for 
all  the  computational  delays,  since  this  does  not  complicate  our  analysis 
procedure  at  all. 

We  also  analyze  the  Heller-Ipsen(1983)  array  for  QR  factorization,  which 
was  studied  by  Tiden  (1984),  as  an  example  of  a  complex  system  with  reg¬ 
ular  topology. 

Forward  Convolution 

This  array  was  developed  by  Rung  and  Leiserson  (1979).  Figure  10  depicts 


(-)  (‘) 

Figure  10:  System  that  performs  convolution:  (a)  network,  (b)  module  and 
timing. 

sequences  are 

Ui  =  {u1(0),u1(l),u1(2),...}  (8) 

«a  =  {0,0,0,...}  (9) 

We  assume  that  Aj  =  0  and  that  we  know  that  A8  =  7,  and  the  iteration 
interval  is  given  as  2. 

Following  steps  1-2  of  the  analysis  procedure,  we  determine  the  shortest 
path  between  v0  and  the  rest  of  the  vertices.  In  practice,  there  is  no  need  to 
write  un  since  it  is  enough  to  write  the  A’s  to  keep  track  of  the  shortest  path 
as  the  A’s  correspond  to  the  shortest  distance  from  u0  to  a  given  vertex. 
Figure  12  shows  one  choice  for  the  shortest-path  tree.  (Note  that  there 
is  another  choice  for  the  shortest-path  tree:  take  eg,7  as  part  of  the  tree 
and  e17  as  a  link.) 
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In  steps  3-6  of  the  analysis  procedure  we  compute  d?c  =  =  2  and 

<5X  =  gcd{ 2}  =  2;  therefore  there  is  no  interleaving  of  sequences  (5x/<5  =  1). 
Furthermore,  the  number  of  states  in  each  of  the  links  is  given  by  s78  = 
s65  =  1  and  s87  =  0. 

Finally,  we  can  write  the  equations  directly.  Note  that  the  equations 
for  vertices  u5  and  v6  reflect  the  fact  that  there  is  a  state  on  edges  e76  and 

^65* 

x2(k)  =  u^k) 
x3(k)  =  x2(k ) 

(10) 

x0(k)  =  ulx2{k)  +  x 7(k  -  1) 

z7(fc)  =  w2Uj(k)  +  u8(k) 

Substituting  these  equations  into  the  following  equations  for  the  outputs 


y*  {k)  =  z3(fc) 

ys{k)  =  w0x3(fc)  +  x*{k  -  1) 


(11) 


we  obtain  (after  substituting  for  the  input  sequences,  eqs.  (8)  and  (9) 


y*{k)  =  ui{k) 

ys(fc)  =  w0«i(fc)  +  wi ui(fc  -  1)  +  u2Ui(k  -  2). 


(12) 


The  last  equation  y$(k)  —  w,-U|(fc  —  t)  is  the  output  of  a  convolution 
system.  Thus  we  have  determined  the  algorithm  performed  by  the  network 
and  the  iteration  interval  for  solving  a  single  instance  of  the  problem,  <5j  =  2. 


In  addition,  we  have  seen  that  the  <5  given  in  the  statement  of  the  problem 
(<5  =  2)  was  valid  for  that  network;  we  have  also  seen  that  the  network,  as 
given,  is  not  interleaving  input  data  from  different  sequences. 

Reverse  Convolution 

This  array  was  developed  by  Kung  and  Leiserson  (1979).  We  call  it  a 
“reverse”  convolution  array,  because  the  output  sequences  are  generated  at 
opposite  ends  of  the  array.  Figure  13  depicts  the  array,  the  graph  G,  and 
a  shortest-path  tree  based  on  v0-  (For  simplicity,  we  do  not  show  the  root, 
v0,  which  has  two  edges:  the  first  one  connects  v0  to  Vi  with  delay  0;  the 
second  one  connects  v0  to  v5  with  delay  9.) 

The  input  sequences  are  zeros  at  v5  and  (u(0),  u(l),  u(2), . . .}  at  vertex 
Vi,  and  the  output  sequences  are  (y(0),  y(l),  y(2), . . .}  at  v&  and,  at 
(u(-2),  ti(-l),  tt(0),  u(l),  u(2),  •  •  ■}  with  u(- 2)  and  u(-l)  supposed  to  be 
known  (they  are  called  initial  conditions).  The  iteration  interval  is  given 
as  4. 

This  system  differs  from  the  forward  convolution  in  three  aspects:  rever¬ 
sal  in  inputs  and  outputs,  reversal  in  the  order  of  the  gains  u/,,  and  double 
iteration  interval.  The  analysis  proceeds  as  for  the  forward  convolution: 
the  graph  G*  has  all  the  weights  zero  except  for  s67  =  s78  =  1.  In  addition, 
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6{  —  gcd{A }  =  4  so  the  system  has  a  valid  <5  and  none  of  the  input  sequences 
is  interleaved.  The  system  equations  are  given  below 


x2(k)  =  uj(fc) 

Xi(k)  =  x2[k) 

x^{k)  —  u2x2(k)  +  Us(k) 

x7(k)  =  ulx2{k)  +  x0(k  -  1) 

Substituting  these  equations  into  the  following  equations  for  the  outputs 

y*(k)  =  x3(k) 

y&{k )  =  w0Ui(fc)  +  x 7(k  -  1) 
we  obtain  (after  substituting  for  the  input  sequences) 

y*(k)  =  Ui[k) 

y&{k)  ~  WqUi(A:)  +  u^Ui^k  —  1)  ■+-  u/2Ui(k  —  2) 

The  last  equation  y$(k)  =  J2i=ouiui(k  -  0  *s  output  of  a  convolution 
system  and  it  is  the  same  equation  that  we  obtained  for  the  forward  con¬ 
volution. 

Sorting 

This  array  was  developed  by  H.T.  Rung  and  first  reported  and  verified  by 
Melhem  and  Rheinboldt  (1984).  The  system  sorts  a  sequence  of  n  real  num- 


bers,  Uj  =  (u^l),  ux(2), . . . ,  iq(A:)}  by  using  the  linear  array  of  n—  1  proces¬ 
sors  depicted  in  Figure  14.  The  output  sequence,  y8  =  {yi(l),yi(2), . . .  ,y i{k)} 
is  sorted  in  ascending  order.  In  this  example,  we  assume  all  the  computa¬ 
tional  delays  to  be  equal  and  of  value  10  since  all  the  operations  are  well 
balanced  and  should  take  the  same  amount  of  time.  The  iteration  interval 
given  is  20.  We  have  only  one  source,  so  we  take  it  directly  as  the  root  with 
Ai  =  0.  Figure  14(c)  and  (d)  show  the  graph  and  the  shortest-path  tree  (in 
this  case  is  unique).  From  it  we  can  readily  write  the  equations 

x2{k)  =  max{u1(k),x7(k  -  1)} 
x3(fc)  =  max{x2(A:),  x6(&  -  1)} 
xt{k)  =  max{x3(k),x5(k  -  1)} 

'  xs(k)  =  x4(fc) 

x0(/c)  =  min{x3(k),x5(k  -  1)} 
x7(A:)  =  min{x2[k),  x0(fc  —  1)} 
y8(fc)  =  min{ul(k),x7(k  -  1)} 

which  can  be  rearranged  as  follows,  after  substituting  x5(&)  =  xA(k) 
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x2(k)  =  max{ul(k),x1(k  —  1)} 
ys(k)  =  min{ui{k),  x7(k  -  1)} 
x3[k)  =  max{x2{k),  x 6{k  -  1)} 

< 

x7(k)  =  min{x2(k) ,  x6(k  —  1)} 
xt[k)  =  max{xi(k),  x^k  -  1)} 
x6(k)  =  min{x3(k)tXi(k  -  1)} 

These  equations  correspond  to  the  so-called  bubble  sort  (see  Knuth, 
1973).  Other  types  of  sorting  algorithms  could  be  implemented,  see,  for 
instance  Rao  (1985)  and  Lang  et  al.  (1985). 

Back  Substitution 

This  array  was  developed  by  Rung  and  Leiserson  (1979)  and  also  verified 
by  Melhem  and  Rheinboldt  (1984).  Figure  15  depicts  the  system,  the  pro¬ 
cessors’  functions,  the  graph  G,  and  a  shortest-path  tree.  This  time,  we 
have  associated  a  computational  delay  of  10  for  each  of  the  edges  in  the 
graph  G;  this  choice  corresponds  to  the  usual  one  for  systolic  arrays:  all 
the  outputs  to  a  cell  are  produced  at  the  same  time  even  if  some  of  the 
outputs  may  take  less  time  to  compute. 

The  input  sequences  are  as  follows:  zeros  at  vertex  u0,  the  elements 
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of  a  column  vector,  b,  at  vertex  t>i,  and  the  diagonals  of  a  banded,  lower- 
triangular  matrix,  A  at  vertices  ti2-u5  (the  main  diagonal  at  v2,  the  first 
subdiagonal  at  v3l  and  so  on).  The  input  sequences  are  as  follows: 

U1  =  {&1>  ^2)  ^3)  •  •  •} 
ti2  =  {an,  a22,  a33, . . .} 
u3  =  {a21 ,  a32,  a43, . . .} 

ui  =  {a3i,  a42,  ar,3, . . .} 

U5  =  {a41,  a52,  a63, . . .} 

1 

We  are  also  given  the  times,  X,  at  which  the  first  input  is  entered;  they 
are  as  follows 

A i  =  A 2  -  30,  A 3  —  40,  A 4  =  50,  A5  =  60,  Ag  =  0. 

These  times  correspond  to  the  weights  in  the  edges  connecting  the  root 
(not  shown)  and  the  input  vertices. 

The  output  is  at  vertex  v13;  we  can  write  it  as 

yi3  =  (y(i),y(2),y(3),...} 

Figure  15(d)  shows  the  states  in  addition  to  the  shortest  path.  We 


computed  them  using  <54  =  gcd{20, 40, 60}  =  20.  From  this  figure  we  can 
write  the  following  equations 


r* 


I7 (^*-)  Xi2(k  3)  •  u$(k  —  3)  +  u e(^) 

'  x${k)  =  Xn(k  —  2)  •  ut(k  —  2)  +  Xi(k) 
x9(k)  =  xlQ(k  -  1)  •  u3(fc  -  1)  +  xg(fc) 

' 

Xio(k)  =  (tii(A:)  -  x2{k))/u2{k) 

Xu  (k)  =  iio(A:) 

x12(k)  =  iu(fc) 
yn(A:)  =  x12(k) 

Substituting  these  equations  we  get 

yn(fc)  =  {ui(k)  -  x0{k))/u2(k) 

where 

x,(k)  =  y(k—  1)  ■  u3(fc  -  1)  +  y{k  -  2)  •  u A{k  -  2)  +  y(fc  -  3)  •  u5(k  -  3)  +  u8(Jfc) 
Substituting  now  into  the  input  and  output  sequences  we  obtain 

ynW  =  (6fc  -  xs{k))/akk 

where 


Xo(k)  —  y(k  1)  •  at*_,  +  y(k  ~  2)  •  akk_2  +  y(k  —  3)  •  akylt_z 
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These  equations  are  the  well  known  method  of  solving  a  triangular  system, 
called  back  substitution.  Therefore,  the  array  discussed  performs  back 
substitution. 

QR  factorization 

Our  last  example  is  the  Heller-Ipsen  (1983)  array  for  QR  factorization  of 
a  banded  matrix.  A  more  general  study  of  such  arrays  can  be  found  in 
Schreiber  (1983).  The  Heller-Ipsen  array  has  been  verified  by  Tiden  (1984) 
using  a  novel  approach:  he  shows  that  the  system  works  with  one  module 
(‘size’  of  the  system  =  1);  then  he  applies  induction  to  the  size  of  the  array, 
assuming  that  it  works  for  size  n,  and  proves  that  it  works  for  size  n  +  1. 

Figure  16  depicts  the  array  and  the  inputs  for  the  matrix  A;  the  numbers 
indicate  the  subindex  for  the  elements  of  this  matrix.  Figure  17  shows  the 
naming  convention  and  the  coordinate  axis  for  the  variables,  and  gives  the 
graph  G"  for  a  row  of  processors.  Note  the  variations  of  the  graph  at  the 
boundaries  and  in  some  of  the  weights  in  the  subdiagonal  cell. 

All  the  computational  delays  are  one  unit  of  time.  The  iteration  interval 
is  given  as  2,  which  is  valid  because  =  gcd{ 0,2,4}  =  2.  With  the 
information  in  Figure  17  we  can  write  directly  the  equations  for  the  system. 
Note  that  the  first  two  coordinates  indicate  position  and  the  third  one  time. 
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al  Inf ormation  Division 


We  write  below  the  regular  iterative  algorithm  for  a  system  with  an  input 
matrix  A  that  has  N  subdiagonals,  a  main  diagonal,  and  N  superdiagonals. 
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